Direct Human-AI Comparison in the Animal-AI Environment

Artificial Intelligence is making rapid and remarkable progress in the development of more sophisticated and powerful systems. However, the acknowledgement of several problems with modern machine learning approaches has prompted a shift in AI benchmarking away from task-oriented testing (such as Chess and Go) towards ability-oriented testing, in which AI systems are tested on their capacity to solve certain kinds of novel problems. The Animal-AI Environment is one such benchmark which aims to apply the ability-oriented testing used in comparative psychology to AI systems. Here, we present the first direct human-AI comparison in the Animal-AI Environment, using children aged 6–10 (n = 52). We found that children of all ages were significantly better than a sample of 30 AIs across most of the tests we examined, as well as performing significantly better than the two top-scoring AIs, “ironbar” and “Trrrrr,” from the Animal-AI Olympics Competition 2019. While children and AIs performed similarly on basic navigational tasks, AIs performed significantly worse in more complex cognitive tests, including detour tasks, spatial elimination tasks, and object permanence tasks, indicating that AIs lack several cognitive abilities that children aged 6–10 possess. Both children and AIs performed poorly on tool-use tasks, suggesting that these tests are challenging for both biological and non-biological machines.


Correlation coefficients
Kendall's Tau, Spearman's Rho, and Pearson's Product Moment Correlation Coefficient (PMCC). Croux and Dehon (2010) argue that each of these are uniformly most powerful under different distributional assumptions, so all three are used within the multiverse approach.

ANOVAs
A Mixed ANOVA was used to examine whether AIs and children differ in their performance in the tasks (between-subjects factor) and across the 10 levels (within-subjects factor). Accuracy was averaged across the four tasks of each level. Normality was checked with Schapiro-Wilk Tests and evaluated visually using QQ-plots (see Appendix I). Homogeneity of Variance was tested with Levene's test, and Homogeneity of Covariance was tested with Box's M-Test. Sphericity was checked with Mauchly's test and corrected for using the Greenhouse-Geisser correction. Main effects of Level (L1-L10) and Agent (AI:Children) and interaction effects of Level*Agent were calculated. Generalised eta-squared (ηg 2 ) effect sizes were reported, with a ηg 2 of 0.2 or above considered to be a large effect size (Lakens, 2013). The Aligned-Rank Transform (ART; Wobbrock et al. 2011;Kay and Wobbrock, 2020) was used facilitate a non-parametric analysis. This permits a Mixed ANOVA, specifically Type III Wald F tests with Kenward-Roger degrees of freedom, to be performed whilst avoiding distributional assumptions of Normality and Sphericity. The equivalent effect size metric (ηg 2 or omega squared ω 2 ; Olejnik and Algina, 2003) was not available for ART transforms.

Normality checks
Generally, the distributions for each level by agent are ambiguous in terms of Normality, with a diverse and heterogenous set of distributions. Shapiro-Wilk Tests for Normality was also performed, testing the null hypothesis that each of the distributions of average accuracy by level are normal. The results are in Table 9.   Most of the distributions are significant departures from Normality. However, not all are, so multiversing with both parametric and non-parametric testing is justified. Normality checks were not multiversed for outliers.

Levene's Tests for Homogeneity of Variance
Levene's test tests the null hypothesis that there is homogeneity of variance between the two samples (between-subjects factor). If p<0.05, then we reject the null hypothesis, meaning that there is not homogeneity of variance. This violates the homoscedasticity assumption of ANOVAs. 22.5*** 6.28*** *p<.05, **p<.01, ***p<.001 Significance was affected by outliers. For agent contrasts, L2 became significant. For age group contrasts, then L2 and L6 also become significant.

Box's M-test for Homogeneity of Covariance
Box's M-test tests the null hypothesis that there is homogeneity of covariance between the two samples (between-subjects factor). Box's M-test is highly sensitive. Tabachnick and Fidell (2001) suggest that heterogeneity of covariance should only be acknowledged and corrected for if we have unequal sample sizes and p<0.001. For the mixed anova for agent-contrasts, M(1)=4.09, p=0.0431. For the mixed anova for age-contrasts, M(5)=7.57, p=0.181. When outliers were removed, the significance for the mixed anova for agent-contrasts was unaffected (M(1)=5.44, p=0.0197). However, when outliers were removed for the age-contrasts, significance was affected (M(5)=21.1, p=0.000762). Paired with unequal sample sizes, this means that the age-contrast mixed ANOVA without outliers is not robust.

Sphericity and Greenhouse-Geisser correction
Sphericity was automatically tested for and corrected by the Greenhouse-Geisser method in the anova function in R.

Two-sample Comparisons
To examine whether the AIs and children differ on the individual levels, Mann-Whitney U Tests were used for the data on each level, using the Bonferroni correction for multiple comparisons (i.e., using an α of 0.05/10 = 0.005 for each test). Vargha and Delaney's A was used as the measure of effect size for these tests. Welch's Two-Sample t-test (and Cohen's d) was used as the parametric alternative. VDA is given on a scale of 0 to 1, with 0.5 meaning that both groups are equal (have equal stochastic dominance on each other). Closer to 1 means that sample 1 has more stochastic dominance over sample 2, and vice versa as VDA approaches 0. To examine how individual age groups compare with AIs, parametric and non-parametric Two-Way Mixed ANOVAs, were run and t-ratio contrast effects calculated, using Kenward-Roger degrees-of freedom and the Tukey correction for multiple comparisons ([emmeans] R package, Lenth et al. 2020).

Clustering analysis
Partitioning Around Medoids algorithm (PAM; Kaufman and Rousseeuw, 1990) was used. This method finds the optimal number of clusters, k, by minimising distances between median values and the real data values. The distance metric can either be in terms of root sum-of-squares differences (Euclidean), or in terms of the sum of absolute distances (Manhattan). The Manhattan method is more robust to outliers and so will be used here. The average silhouette method is used to estimate the optimal number of clusters in a dataset, by computing PAM for various values of k (clusters) and determining the quality of those clusters in terms of the distance metric. Quality is defined by 'average silhouette width', meaning the average distance between each data point in one cluster and one of the other clusters, and is measured from -1 to 1, with 1 indicating high clustering of the data points and -1 indicating that data points should be classified as being in different clusters. 0 indicates that data points are on average equidistant from all clusters, suggesting a non-clustered distribution (Rousseeuw, 1987). Values are reported from 0 to 1 since PAM does not generate negative silhouette widths. Strong clustering is suggested by an average silhouette width of at least 0.75, medium clustering by a width of at least 0.5, and weak clustering by a width of at least 0.25 (ibid.). Data was collected about how many hours of video gaming the children engaged in per week, which kinds of videogames these were, and what controllers were used. Kendall's tau was used to determine correlation between number of hours played and the output of the k-medoids clustering analysis.
There was no significant correlation between number of hours played and the clusters (rτ = -0.1807, z=-1.4495, p=0.1472). This was not multiversed.
The phi coefficient, implemented using the [sjstats] R package (Lüdecke, 2020), was used to determine the association between clusters and binary responses on questions about game-type and controller-type. Fisher's Exact Test was used to calculate confidence values of coefficients. Significance levels are applied after Bonferroni correction. The results are presented in Table 9.

MANOVAs with 'ironbar' and 'Trrrrr'
Both 'ironbar' and 'Trrrrr' were individually compared to children first in terms of the percentiles they were in with respect to the children's performances (see main text). Then they were compared using one-sample Hotelling's T 2 test across all 40 tasks, using the χ 2 -distribution and the Fdistribution ([desctools] R package, Signorelli et al., 2020). A non-parametric equivalent, Hallin and Paindaveine's (2002a, b) Multivariate Signed-Rank Test (with Tyler Angles) was run, using various settings for the pseudo mahalanobis distance and the p-value computation method. Specifically, the settings for pseudo mahalanobis distance were 'rank', 'sign', or 'normal'. The p-value computation techniques were approximation and bootstrapping (with 1000 permutations) ([ICSNP] R package, Nordhausan, Sirkia, Oja, and Tyler, 2018). Using the F-distribution for the Hotelling's test enabled the computation of Bonferroni and simultaneous confidence intervals adjusted for family-wise error rate, for post hoc comparisons on a task-by-task basis. Simultaneous confidence intervals were too conservative, resulting in impossible values being included in the 95% confidence interval, so Bonferroni confidence intervals were used. The assumption of Normality was violated in most of these cases by the child sample, but there is evidence to suggest that there is some robustness to this when using this kind of analysis (Finch, 2005). Correlation coefficients by level and by task were also generated for 'Trrrrr' and 'Ironbar' individually. All analyses were conducted in R. All plots were generated using [ggplot2] R package (Wickham, 2016) unless otherwise stated. Clustering was performed using the [cluster] (Maechler et al. 2019) and [factoextra] R packages (Kassambara and Mundt, 2020). UMAP was performed with the [umap] R package (Konopka, 2020), and the ShinyApp created using [shiny] and [shinydash] packages (Chang, Cheng, et al. 2020;Chang, Borges Ribeiro, et al., 2020).

Outline of 'Get the Fruit!' game
All images of the Animal-AI Environment and Testbed below are licensed under Apache License, Version 2.0 (http://www.apache.org/licenses/LICENSE-2.0) Dashboard information and classes of objects in game: 1. Configuration = task code. Reward = number of 'points' accrued. Starts at 0 and goes down with time, shown by green time bar and percentage. This turns red when 20% of the time or less is left, acting as an exogenous cue. Points go up by 1 when a yellow 'fruit' (not shown, see below) is retrieved. 2. Blue platforms. There is no way to ascend on to these, unless a ramp is available (7). In some tasks, the participant begins on a blue platform; once they move of it, they cannot go back, enabling forced choice tasks to be generated. 3. Green 'fruit' is the target of the level. It needs to be retrieved within the time limit and after all yellow 'fruit' have been retrieved in order to pass. All fruit can be stationary OR moving, at varying directions, speeds, and accelerations. 4. Red 'fruit' should be avoided. It is 'poisonous'. Touching it causes immediate failure of the level. 5. These are 'cardboard boxes', they are pushable and move around easily. 6. These are pushable blocks. They take more effort to push. 7. Ramps allow movement in the third dimension. They are always pink. They cannot be moved 8. Other obstacles are opaque or transparent, and can be of varying sizes and shapes, including tunnels.
Follow this link for the tutorial video presented to the children: https://www.youtube.com/watch?v=oA9WHMPAONM You may also play the full game presented to the children by contacting the corresponding author. There are two stages to the game: Tutorial levels (involving 15 tasks -humans only) and Test levels (involving 40 tasks for both humans and AIs). Tutorial tasks were created to introduce the rules and parameters of the game. Test tasks were randomly selected from the 900 available in the Animal-AI Olympics test battery. 4 tasks for each level were randomly selected, and then for the human participants, one of the three variants was selected. The numbering system below refers to the configuration. For example, '2-29-1' refers to the second level, the 29 th task, and its 1 st variant in the AAI test battery.
Tutorial 0-1-1 This is an empty arena so participants can practice the controls (arrow keys/ASWD). All participants automatically pass.
Tutorial 0-2-1 Simple retrieval of green 'fruit' from simple starting position. Participant passes if they successfully get it within the time limit.
Tutorial 0-2-2 This level introduces yellow 'fruit' which give you 1 point each. It also explains that if there is no green 'fruit' available, collect as many yellow 'fruit as you can. Participants pass if they collect both 'fruit' within the time limit.
Tutorial 0-2-3 This level introduces 'fruit' of different sizes, explaining that larger 'fruit' are more preferable. Participants only pass if they retrieve the larger green 'fruit'. This level synthesises the previous two, explaining that to pass the participant must retrieve the maximum number of points possible, in this case, collecting the two yellow 'fruit' before the large green one.

Tutorial 0-3-1
This level describes that the aim is to get the maximum points possible. Here, there are no green 'fruit' so success is determined by selecting the side containing 2 yellow 'fruit'. Also outlines the role of blue platforms. These cannot be re-ascended (unless there is a pink ramp), resulting in forced-choice tasks like this one.
Tutorial 0-4-1 This level allows the participant to explore the different kinds of stationary objects they might encounter, including pink ramps of different heights and opaque vs. transparent obstacles of varying sizes. The 'lights go out' periodically, to expose participants to this possibility, in which there is no visual feedback for a few timesteps. All participants automatically pass.
Tutorial 0-5-1 This level allows the participant to explore the 'dangers' of the game. Orange 'hot zones' accelerate point decrement per time step so shouldn't be stepped on for very long. Red 'lava zones' and 'poisonous red fruit' cause immediate failure if they are touched. Participants pass if they do not touch the red objects and don't spend too long in the orange zones.
Tutorial 0-6-1 This level allows the participant to explore the different kinds of pushable object, namely the 'boxes' and the pushable blocks, helping them generate an understanding of their affordances. All participants automatically pass.
Tutorial 0-7-1 This level includes a random arrangement of the objects previously introduced. The random generator is seeded so every participant views the same random arrangement. The participant passes as long as they do not touch the red 'fruit'.
Test 3-21-1 Pass criterion: to pass the participant must ascend the ramps in order to obtain the partly visible green 'fruit' on the third platform, within the time limit.
Test 3-18-1 Pass criterion: to pass the participant must ascend the ramp and navigate across the platform in order to obtain the green 'fruit' within the time limit.
Test 4-13-1 Pass criterion: free-choice T-maze. To pass the participant must navigate around the 'lava zone' and retrieve the green 'fruit' within the time limit.
Test 4-22-1 Pass criterion: to pass the participant must navigate across the 'bridge' to retrieve the green 'fruit' within the time limit.
Test 5-15-1 Pass criterion: the green 'fruit' is perched on a pole and is unreachable. The participant must push the 'box' towards it to knock the 'fruit' off so that they can retrieve it, within the time limit.
Test 5-9-1 Pass criterion: forced-choice spatial elimination task. The participant must select the right side and navigate behind the barrier to obtain the green 'fruit' within the time limit.
Test 5-24-1 Pass criterion: 4-arm radial arm maze. To pass the participant must retrieve all 4 yellow 'fruit' within the time limit.
Test 5-26-1 Pass criterion: 6-arm radial arm maze. To pass the participant must retrieve all 6 yellow 'fruit' within the time limit.
Test 6-9-1 Pass criterion: participant must navigate within the oddly coloured obstacle-less environment to retrieve the green 'fruit' (not visible) within the time limit.
Test 6-12-2 Pass criterion: Inverted Y-maze variant. To pass the participant must navigate around the barrier (shaped like a fence so that the goal is visible through it), to obtain the green 'fruit' within the time limit.
Test 7-16-1 Pass criterion: participant must navigate around the 'lava zone' to retrieve the green 'fruit' within the time limit. The 'lights go out', meaning all visual feedback is withheld, periodically for a few timesteps.
Test 7-17-1 Pass criterion: participant must navigate around the 'lava zone' to retrieve the green 'fruit' within the time limit. The 'lights go out', meaning all visual feedback is withheld, periodically for a few timesteps.
Test 7-22-1 Pass criterion: the participant must obtain the green 'fruit' that falls down the runway, which includes a rightangled kink, within the time limit.
Test 7-25-1 Pass criterion: the participant must obtain all yellow 'fruit' within the time limit. The 'light go out' periodically and remain off towards the end of the level.
Test 8-30-1 Explanation and pass criterion: similarly to 8-19-1, however, as the 'lights go out', the opaque barriers on both sides start to drop. When they come back on, the green 'fruit' is not visible, the left barrier is resting on the shorter blocks and is visibly not far enough from the arena floor to be hiding a green 'fruit', and the right barrier is resting on the taller blocks and is far enough from the arena floor. To pass, the participant must choose the right side and navigate around the barrier to retrieve the green 'fruit', within the time limit.
Test 8-11-1 Pass criterion: the green 'fruit' visibly drops into the hole pictured. The participant must navigate to the correct hole and drop down to obtain the green 'fruit' within the time limit.
Test 9-21-1 Explanation and pass criterion: forced choice numerosity task. The three yellow 'fruit' roll rightwards and drop behind the right barrier. To pass, the participant must choose the right side and navigate behind the barrier to collect the three yellow 'fruit', within the time limit.
Test 9-24-1 Explanation and pass criterion: forced choice numerosity task. The three yellow 'fruit' on the right roll leftwards and drop behind the left barrier. The one yellow 'fruit' on the right rolls rightwards and drops behind the right barrier. To pass, the participant must choose the left side and navigate behind the barrier to collect the three yellow 'fruit', within the time limit.